The code of this work can be found here. This is just a summary that sustains the idea behind these project.
We obtained the data from the webpage propiedades.com. We use the address of the apartment, bedrooms, bathrooms, half bathrooms, \(m^2\) of the apartment, mantainance fee, rent price, if the rent price is in MXN or USD. We then clean the raw data and remove duplicates. After that we use the ggmap package to get latitude and longitude of the apartments (and at the same time we validate the address). Then we homologate the currency of the rent prices (MXN) and finally we do missing data imputation over the number of bathrooms and the \(m^2\) because those values were missing in some of the apartments.
The head of the data frame we use is as follows:
library(dplyr)
library(knitr)
datos <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/datos_finales.rds')
kable(datos %>%
dplyr::select(dir,rec,banios.m,const.m,est,precio,lon,lat)%>%
sample_n(20))
| dir | rec | banios.m | const.m | est | precio | lon | lat | |
|---|---|---|---|---|---|---|---|---|
| 862 | Rio Lerma 76 Centro Cuauhtémoc 06500 Distrito Federal | 1 | 1 | 56.00000 | 0 | 18500 | -99.16521 | 19.43011 |
| 358 | Bosque De Jazmines Bosque de las Lomas 11700 Distrito Federal | 1 | 1 | 80.00000 | 0 | 12500 | -99.24378 | 19.40943 |
| 797 | Av Paseo de la Reforma Juárez 06600 Distrito Federal | 1 | 1 | 171.94409 | 0 | 20000 | -99.19373 | 19.42571 |
| 110 | ArquÃmedes Polanco V Sección 11560 Distrito Federal | 3 | 3 | 166.00000 | 2 | 32000 | -99.19196 | 19.42713 |
| 378 | Sierra Gorda Lomas de Chapultepec III Sección 11000 Distrito Federal | 3 | 3 | 350.00000 | 3 | 120000 | -99.22237 | 19.42814 |
| 513 | Concepción Beistegui Del Valle Centro 03100 Distrito Federal | 2 | 2 | 190.00000 | 1 | 29000 | -99.16890 | 19.39033 |
| 534 | Dakota 412 Napoles 03810 Distrito Federal | 1 | 1 | 50.00000 | 0 | 10000 | -99.18011 | 19.38605 |
| 765 | Rio Po Poniente Cuauhtémoc 06500 Distrito Federal | 2 | 2 | 130.00000 | 0 | 18500 | -99.16833 | 19.42979 |
| 579 | Calle 18 San Pedro de los Pinos 03800 Distrito Federal | 2 | 2 | 71.00000 | 2 | 12800 | -99.18977 | 19.38747 |
| 741 | Reforma 222 222 Centro Juárez 06600 Distrito Federal | 2 | 2 | 90.00000 | 0 | 40000 | -99.16204 | 19.42909 |
| 432 | Avenida 96 San Pedro de los Pinos 03800 Distrito Federal | 2 | 2 | 80.00000 | 1 | 15000 | -99.18894 | 19.38570 |
| 185 | Mariano Escobedo 1 Anzures 11590 Distrito Federal | 3 | 4 | 250.00000 | 2 | 45000 | -99.18206 | 19.45008 |
| 315 | Lago Mask Los Manzanos 11460 Distrito Federal | 2 | 2 | 63.82000 | 1 | 8500 | -99.18458 | 19.44782 |
| 554 | Luz Saviñón Del Valle Norte 03103 Distrito Federal | 3 | 2 | 103.00000 | 1 | 15500 | -99.16682 | 19.39260 |
| 759 | Av Insurgentes 23 Centro San Rafael 06470 Distrito Federal | 1 | 1 | 70.00000 | 0 | 12000 | -99.15634 | 19.43881 |
| 878 | Tokio 18 Juárez 06600 Distrito Federal | 1 | 1 | 32.71142 | 0 | 10500 | -99.17064 | 19.42420 |
| 174 | Cofre De Perote 371 Lomas de Chapultepec I Sección 11000 Distrito Federal | 3 | 2 | 220.00000 | 2 | 33000 | -99.22588 | 19.42383 |
| 627 | MartÃn Mendalde 986 Del Valle Sur 03104 Distrito Federal | 2 | 2 | 110.00000 | 2 | 16500 | -99.16785 | 19.38092 |
| 17 | Monte Elbruz 1 Lomas de Chapultepec I Sección 11000 Distrito Federal | 3 | 4 | 210.00000 | 2 | 38000 | -99.20395 | 19.42962 |
| 411 | Moras 321 Del Valle Sur 03104 Distrito Federal | 3 | 2 | 200.00000 | 2 | 28000 | -99.17322 | 19.37640 |
The plot of the apartmets over a map looks like these:
datos <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/datos_finales.rds')
latq <- quantile(datos$lat,c(0.38))
lonq <- quantile(datos$lon,c(0.28))
df <- get_map(location = c(lonq,latq),
maptype = 'roadmap', zoom = 12)
ggmap(df, extent = 'panel') +
geom_point(aes(x=lon,y=lat),colour = 'blue', alpha = 1,
size = 2, data = datos)
We want to look at the rent price dispersion over the addresses so we make the following graph:
datos <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/datos_finales.rds')
qmplot(lon, lat, data=datos,
size=precio.c, alpha=I(0.8), color=precio_log,
geom=c('point'), source='google')
That graph shows that, in fact, are some places where the most expensive apartments are concentrated. The idea is apply spatial models in these data so we make a graph of the level curves of the prices in order to look that there is a zone where the price of the apartments is higher and when you get far of that zone the rent prices get lower.
fma <- precio_log ~ 1 + x + y + I(x*y) + I(x^2) + I(y^2) +
I(x^2*y) + I(x*y^2) + I(x^3) + I(y^3)
mod.1 <- lm(formula = fma, data = datos)
summary(mod.1)
##
## Call:
## lm(formula = fma, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.54645 -0.33878 -0.03304 0.31756 1.90224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.032e+01 2.988e-02 345.226 < 2e-16 ***
## x -1.547e+01 1.700e+00 -9.098 < 2e-16 ***
## y 1.065e+01 1.599e+00 6.657 4.61e-11 ***
## I(x * y) 1.439e+02 4.407e+01 3.264 0.001134 **
## I(x^2) -1.241e+02 2.833e+01 -4.382 1.30e-05 ***
## I(y^2) -5.855e+02 3.800e+01 -15.407 < 2e-16 ***
## I(x^2 * y) -4.187e+03 1.236e+03 -3.388 0.000732 ***
## I(x * y^2) 6.516e+03 1.432e+03 4.549 6.05e-06 ***
## I(x^3) 1.438e+03 8.178e+02 1.759 0.078943 .
## I(y^3) -1.151e+04 1.104e+03 -10.424 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5416 on 990 degrees of freedom
## Multiple R-squared: 0.3608, Adjusted R-squared: 0.355
## F-statistic: 62.09 on 9 and 990 DF, p-value: < 2.2e-16
surf <- akima::interp(x=datos$x, y=datos$y, z=datos$precio_log,
xo = seq(min(datos$x), max(datos$x), length = 50),
yo = seq(min(datos$y), max(datos$y), length = 50),
linear = FALSE, extrap = TRUE,
duplicate = "median")
surf.r <- cbind(expand.grid(surf$x, surf$y), c(surf$z))
colnames(surf.r) <- c("x", "y", "z")
surf.r$z <- predict(mod.1, surf.r)
bks <- as.numeric(quantile(surf.r$z))
surf.r$pr <- as.factor(cut(surf.r$z,breaks = bks))
ggplot(surf.r, aes(x = x, y = y, z = z)) +
geom_tile(aes(fill = pr)) +
scale_fill_brewer(palette = "Blues") +
stat_contour()
With the previous graph we confirm that the apartment rent price gets lower as we go far of the Polanco zone; with these graph we also found that the rent prices are similar between closer apartments so fitting a spatial model seems right.
datos_2 <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/datos_finales.rds')
datos_2$xx<-with(datos_2,x+media_lon)
datos_2$yy<-with(datos_2,y+media_lat)
We now proceed to the model. For these we used bathrooms, bedrooms, \(m^2\) and parking spaces.
datos <- datos_2 %>%
dplyr::select(rec,banios.m,const.m,est,precio_log,x,y)
# Kriging
# Estimacion de la variación a gran escala
# Polinomio de tercer orden
fma <- precio_log ~ 1 + x + y + rec + banios.m + const.m + est
mod.1 <- lm(formula = fma, data = datos)
summary(mod.1)
##
## Call:
## lm(formula = fma, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81971 -0.23653 0.00519 0.26291 1.58607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.265328 0.039863 232.430 < 2e-16 ***
## x -2.934965 0.559000 -5.250 1.86e-07 ***
## y 3.242202 0.550129 5.894 5.17e-09 ***
## rec -0.134960 0.020497 -6.585 7.38e-11 ***
## banios.m 0.210040 0.022063 9.520 < 2e-16 ***
## const.m 0.003588 0.000191 18.790 < 2e-16 ***
## est 0.057837 0.014082 4.107 4.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4093 on 993 degrees of freedom
## Multiple R-squared: 0.6338, Adjusted R-squared: 0.6316
## F-statistic: 286.4 on 6 and 993 DF, p-value: < 2.2e-16
# Estimacion de la variacion de pequeña escala
datos$mu_hat <- mod.1$fitted.values
datos <- mutate(datos, eta_hat = precio_log - mu_hat)
renta_datos <- data.frame(x = datos$x, y = datos$y, eta_hat = datos$eta_hat)
coordinates(renta_datos) = ~ x + y
emp_variog <- variogram(eta_hat ~ 1, data = renta_datos, width = 0.001)
For the model we need to adjust the empirical semivariogram (we need it because it give us information about the average rate change of prices caused by the distance) in order to check spatial dependencies in the apartment rent prices. Alongside to the empirical semivariogram we plot the theoric semivariogram; for these we adjust an spheric semivariogram because we thought that the spatial dependence decreases approximately in linear form as the distance increases.
sph.variog <- function(sigma2, phi, tau2, dist){
n <- length(dist)
sph.variog.vec <- rep(0,n)
for(i in 1:n){
if(dist[i] < phi){
sph.variog.vec[i] <- tau2 + (sigma2 * (((3 * dist[i]) / (2 * phi)) -
((dist[i] ^ 3)/(2 * (phi ^ 3)))))
}
if(dist[i] >= phi){
sph.variog.vec[i] <- tau2 + sigma2
}
}
return(sph.variog.vec)
}
sph_variog.w <- fit.variogram(emp_variog,
vgm(psill= 0.05, model="Sph", range = 0.03, nugget = 0.1),
fit.method = 7)
sigma2 <- sph_variog.w$psill[2]
phi <- sph_variog.w$range[2]
tau2 <- sph_variog.w$psill[1]
dist <- sph_variog.w$dist # vector de distancias
dat <- data.frame(x = seq(0, 0.06, 0.0001), y = seq(0, 0.12, 0.0002))
ggplot(dat, aes(x = x, y = y)) + ylim(0, 0.2) +
labs(title = expression("Semi-variograma esférico ajustado"),
x = "distancia", y = "semivarianza") +
stat_function(fun = sph.variog, args = list(sigma2 = sigma2, phi = phi, tau2 = tau2),
colour = "green3") +
geom_point(data = emp_variog, aes(x = dist, y = gamma))
The Kriging model fitted is:
# Estimacion de beta utilizando la matriz estimada de covarianzas
trend <- ~ 1 + datos$x + datos$y + datos$rec +
datos$banios.m + datos$const.m + datos$est
depas_geo <- as.geodata(obj = datos, coords.col = c(6, 7), data.col = 5)
depas_reml <- likfit(depas_geo, trend = trend, cov.model = "spherical",
ini.cov.pars = c(sigma2, phi), nugget = tau2,
fix.nugget = FALSE, lik.met = "REML")
saveRDS(depas_reml,'depas_reml.rds')
The \(\beta\)’s of the model are:
depas_reml <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/depas_reml.rds')
kable(depas_reml$beta)
| intercept | 8.9575888 |
| covar1 | -4.6482967 |
| covar2 | -3.9731080 |
| covar3 | -0.0017291 |
| covar4 | 0.1233528 |
| covar5 | 0.0027884 |
| covar6 | 0.0569030 |
kable(mod.1$coeff)
| (Intercept) | 9.2653279 |
| x | -2.9349652 |
| y | 3.2422019 |
| rec | -0.1349601 |
| banios.m | 0.2100401 |
| const.m | 0.0035882 |
| est | 0.0578370 |
sigma2.reml <- depas_reml$sigmasq
phi.reml <- depas_reml$phi
tau2.reml <- depas_reml$tausq
#Kriging
kc.control <- krige.control(type.krige = "ok",
trend.d = trend,
trend.l = trend,
obj.model=depas_reml,
cov.model = "spherical",
cov.pars=c(sigma2.reml, phi.reml),
nugget = tau2.reml)
loc <- matrix(c(datos$x,datos$y), nrow=nrow(datos), ncol=2)
#Prediccion
kc.sk.s0 <- krige.conv(depas_geo,
locations=loc,
krige=kc.control)
## krige.conv: model with mean defined by covariates provided by the user
## krige.conv: Kriging performed using global neighbourhood
datos$ks_precio<-kc.sk.s0$predict
And the predictions obtained look like these:
datos<- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/datos_prediccion.rds')
datos$pks_precio.c <- quantileCut(datos$ks_precio,5)
ggplot(datos, aes(x = x, y = y, colour = pks_precio.c)) +
geom_point(size = 2.3) +
scale_colour_brewer(palette = "Blues")
mean_lon <--99.18264
mean_lat <- 19.41507
aux <- datos %>%
dplyr::select(x, y, ks_precio,pks_precio.c) %>%
mutate(lon = x + mean_lon, lat= y + mean_lat )
qmplot(lon, lat, data=aux,
size=pks_precio.c, alpha=I(0.8), color=ks_precio,
geom=c('point'), source='google')
We now proceed to the bayesian version of Kriging. For these we need priors for the sill, nugget and range. For the last two we use an Inverse-Gamma and for the sill we use a Uniform distribution. With the priors and the data we fit a model using 5000 iterations of the Monte Carlo method.
datos <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/datos_finales.rds') %>%
dplyr::select(rec,banios.m,const.m,est,precio_log,x,y)
N <- nrow(datos)
coords <- as.matrix(cbind(datos$x,datos$y),nrow=N,ncol=2)
Y <- datos$precio_log
Dist.mat <- rdist(coords)
max(Dist.mat)
# Valores iniciales
beta.ini <- rep(0,5)
sigma2.ini <- 1/rgamma(1,2,1)
tau2.ini <- 1/rgamma(1,2,0.5)
phi.ini <- runif(1,min=0.0008,max=0.06)
# Ajuste del modelo
fma <- Y ~ datos$x + datos$y + datos$rec + datos$banios.m +
datos$const.m + datos$est
model.1 <- spLM(formula = fma,
coords=coords,starting=list("phi"=phi.ini,"sigma.sq"=sigma2.ini, "tau.sq"=tau2.ini),
tuning=list("phi"=0.05, "sigma.sq"=0.1, "tau.sq"=0.05),
priors=list("phi.Unif"=c(0.0008, 0.06), "sigma.sq.IG"=c(2, 1),
"tau.sq.IG"=c(2, 0.5)), cov.model="spherical",
n.samples=5000, verbose=TRUE, n.report=100)
saveRDS(model.1,'model.rds')
model.1 <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/model.rds')
Algunas simulaciones de los parámetros de covarianza se ven asÃ
kable(model.1$p.theta.samples[1:5,])
| sigma.sq | tau.sq | phi |
|---|---|---|
| 0.3401744 | 0.4894106 | 0.0112989 |
| 0.3401744 | 0.4894106 | 0.0112989 |
| 0.3155258 | 0.4291275 | 0.0120334 |
| 0.3155258 | 0.4291275 | 0.0120334 |
| 0.2130951 | 0.3585916 | 0.0093130 |
The marignal posterior distribution of the parameters look like these:
# Graficas y resumen de la distribucion posterior marginal
# de los parametros de covarianza
plot(model.1$p.theta.samples)
print(summary(model.1$p.theta.samples[1001:5000,]))
## sigma.sq tau.sq phi
## Min. : 47.71 Min. :0.05635 Min. :0.01391
## 1st Qu.: 98.64 1st Qu.:0.06979 1st Qu.:0.03513
## Median :121.51 Median :0.07325 Median :0.04810
## Mean :136.60 Mean :0.07334 Mean :0.04449
## 3rd Qu.:158.26 3rd Qu.:0.07648 3rd Qu.:0.05376
## Max. :491.97 Max. :0.09247 Max. :0.05931
print(summary(model.1$p.theta.samples[1001:5000,]))
## sigma.sq tau.sq phi
## Min. : 47.71 Min. :0.05635 Min. :0.01391
## 1st Qu.: 98.64 1st Qu.:0.06979 1st Qu.:0.03513
## Median :121.51 Median :0.07325 Median :0.04810
## Mean :136.60 Mean :0.07334 Mean :0.04449
## 3rd Qu.:158.26 3rd Qu.:0.07648 3rd Qu.:0.05376
## Max. :491.97 Max. :0.09247 Max. :0.05931
The 95% probability intervals of the parameters look like these:
# Intervalo de 95% de probabilidad para parametros de covarianza
apply(model.1$p.theta.samples[1001:5000,],2,quantile,c(0.025,0.975))
## sigma.sq tau.sq phi
## 2.5% 70.66285 0.06476423 0.02036274
## 97.5% 270.07822 0.08329890 0.05820462
# Recuperar coeficientes de estimacion con burning de 1000
model.1.recov <- spRecover(model.1, start=1000, verbose=TRUE)
saveRDS(model.1.recov, 'model.1.recov.rds')
The marginal posterior probability of the \(\beta\) coefficients looks like these:
model.1.recov <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/model.1.recov.rds')
# Graficas y resumenes de la distribucion marginal posterior
# de los coeficientes beta (parametros de regresion)
plot(model.1.recov$p.beta.recover.samples[,1:3])
plot(model.1.recov$p.beta.recover.samples[,4:5])
plot(model.1.recov$p.beta.recover.samples[,6:7])
summary(model.1.recov$p.beta.recover.samples)
##
## Iterations = 1:4001
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 4001
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## (Intercept) 8.492824 1.170e+01 1.850e-01 2.006e-01
## datos$x -4.278745 1.032e+01 1.632e-01 1.632e-01
## datos$y -7.693852 1.102e+01 1.743e-01 1.783e-01
## datos$rec -0.004121 1.741e-02 2.752e-04 3.025e-04
## datos$banios.m 0.124279 1.767e-02 2.793e-04 2.793e-04
## datos$const.m 0.002797 1.586e-04 2.507e-06 2.445e-06
## datos$est 0.056882 1.073e-02 1.697e-04 1.697e-04
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## (Intercept) -14.507671 0.973703 8.635634 15.639209 31.374954
## datos$x -24.758131 -10.940044 -4.387169 2.396214 15.894705
## datos$y -30.459534 -15.069522 -7.477607 -0.285101 13.687851
## datos$rec -0.038788 -0.015407 -0.003949 0.007808 0.029787
## datos$banios.m 0.089849 0.112016 0.124376 0.136385 0.158942
## datos$const.m 0.002486 0.002687 0.002798 0.002901 0.003106
## datos$est 0.035514 0.049826 0.057021 0.064126 0.077463
# eta prima
eta.hat <- t(model.1.recov$p.w.recover.samples)
model.1.eta.summary <- summary(mcmc(t(model.1.recov$p.w.recover.samples)))$quantiles[,c(3,1,5)]
model.1.eta.summary[1:5,]
## 50% 2.5% 97.5%
## var1 1.0915416 -21.59141 24.16618
## var2 1.1097061 -21.70512 24.20105
## var3 0.8599669 -22.06180 23.85060
## var4 1.1667923 -21.63231 24.11181
## var5 1.0549532 -21.80423 24.01024
We now proceed to the predictions:
# Hacemos la prediccion
# Matriz de covariables
predcov <- matrix(cbind(rep(1,N),datos$x,datos$y,datos$rec,
datos$banios.m,datos$const.m,datos$est),
nrow=N,ncol=7)
# Utiliza muestreo por composicion con MCMC a partir de
# la distribucion predictiva a posteriori
# El burning es de 1000 muestras y se hacen predicciones cada
# dos muestras (parametro thin)
pred <- spPredict(model.1,
pred.coords=coords,
pred.covars=predcov,
start=1000,
thin=2)
saveRDS(pred, 'pred.rds')
We obtain the posterior mean of the predictions:
pred <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/pred.rds')
str(pred$p.y.predictive.samples)
## num [1:1000, 1:2001] 10.37 10 9.68 9.62 10.02 ...
## Media posterior de las predicciones.
post.pred.mean <- rowMeans(pred$p.y.predictive.samples)
(post.pred.mean[1:10])
## [1] 10.367222 9.998798 9.680344 9.615805 10.021271 9.798127 10.126631
## [8] 9.210340 10.043249 10.819778
# Predicciones de kriging
datos <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/datos_finales.rds')
datos_pred <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/datos_prediccion.rds')
# Regresion lineal
fma <- precio_log ~ 1 + x + y + rec + banios.m + const.m + est
mod.1 <- lm(formula = fma, data = datos)
summary(mod.1)
##
## Call:
## lm(formula = fma, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81971 -0.23653 0.00519 0.26291 1.58607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.265328 0.039863 232.430 < 2e-16 ***
## x -2.934965 0.559000 -5.250 1.86e-07 ***
## y 3.242202 0.550129 5.894 5.17e-09 ***
## rec -0.134960 0.020497 -6.585 7.38e-11 ***
## banios.m 0.210040 0.022063 9.520 < 2e-16 ***
## const.m 0.003588 0.000191 18.790 < 2e-16 ***
## est 0.057837 0.014082 4.107 4.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4093 on 993 degrees of freedom
## Multiple R-squared: 0.6338, Adjusted R-squared: 0.6316
## F-statistic: 286.4 on 6 and 993 DF, p-value: < 2.2e-16
# Predicciones de kriging bayesiano
N <- 8100
xo <- seq(min(datos$x), max(datos$x), length = 90)
yo <- seq(min(datos$y), max(datos$y), length = 90)
grid <- akima::interp(x=datos$x, y=datos$y, z=datos$precio_log,
xo = xo, yo = yo,
linear = FALSE, extrap = TRUE,
duplicate = "median")
grid <- cbind(expand.grid(grid$x, grid$y), c(grid$z))
colnames(grid) <- c('x','y','z')
coords <- grid[,c('x','y')]
predcov <- matrix(cbind(rep(1,N),coords$x,coords$y,
sample(datos$rec,N,replace=T),
sample(datos$banios.m,N,replace=T),
rgamma(N,2.62,0.017),
sample(datos$est,N,replace=T)),
nrow=N,ncol=7)
model.1 <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/model.rds')
model.0 <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/depas_reml.rds')
sigma2.reml <- model.0$sigmasq
phi.reml <- model.0$phi
tau2.reml <- model.0$tausq
trend <- ~ 1 + datos$x + datos$y + datos$rec +
datos$banios.m + datos$const.m + datos$est
depas_geo <- as.geodata(obj = datos, coords.col = c(20, 21), data.col = 22)
trend_pred <- ~ 1 + coords$x + coords$y + predcov[,4] +
predcov[,5] + predcov[,6] + predcov[,7]
kc.control <- krige.control(type.krige = "ok",
trend.d = trend,
trend.l = trend_pred,
obj.model = model.0,
cov.model = "spherical",
cov.pars=c(sigma2.reml, phi.reml),
nugget = tau2.reml)
kc.sk.s0 <- krige.conv(depas_geo,
locations=coords,
krige=kc.control)
## krige.conv: model with mean defined by covariates provided by the user
## krige.conv: Kriging performed using global neighbourhood
pred_ok_media <- kc.sk.s0$predict
pred_ok_var <- kc.sk.s0$krige.var
And with all that we can make the next plots:
pred_grid_bayesiano <- spPredict(model.1,
pred.coords=coords,
pred.covars=predcov,
start=500,
thin=10)
saveRDS(pred_grid_bayesiano,'pred_grid_bayesiano.rds')
N <- 8100
pred_grid_bayesiano <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/pred_grid_bayesiano.rds')
post_pred_bayesiano <- rowMeans(pred_grid_bayesiano$p.y.predictive.samples)
post_pred_95ci_bayes<-t(apply(pred_grid_bayesiano$p.y.predictive.samples,1,quantile,c(0.25,0.975)))
dat <- data.frame(id=1:N,media=post_pred_bayesiano,post_pred_95ci_bayes)
dat <- cbind(dat,predcov)
colnames(dat)<-c('id','media','lb_bayes','ub_bayes','int','x','y','rec','banios','const','est')
dat$lon <- dat$x + mean_lon
dat$lat <- dat$y + mean_lat
dat$ks_precio<-kc.sk.s0$predict
lonq <- median(dat$lon)
latq <- median(dat$lat)
df <- get_map(location = c(lonq,latq),
maptype = 'roadmap', zoom = 13)
saveRDS(df,'df.rds')
df <- readRDS('/Users/Gers/Dropbox/Data\ Incubator/Project/data/df.rds')
ggmap(df, extent = 'panel') +
geom_tile(data = dat,aes(x = lon, y = lat, z = media,
fill = media, alpha=1)) +
stat_contour(data=dat,aes(z = media),binwidth=10,size=0.5)+
scale_fill_gradient2(low = "#0000FF", mid = "#FFFFFF", high ="#FF0000",
midpoint = median(dat$media), space = "rgb", guide = "colourbar")
qmplot(lon, lat, data=dat,
size=media, alpha=I(0.5), color=media,
geom=c('point'), source='google') +
scale_color_gradient(low='pink', high='red')
In both of the previous plots we notice that the Polanco and surrounding areas has a higher mean rent price and as you go away the rent price gets lower.
# Agregamos los intervalos de prediccion de kriging ordinario
dat$pred_ok <- pred_ok_media
dat$lb_ok <- pred_ok_media - 2*sqrt(pred_ok_var)
dat$ub_ok <- pred_ok_media + 2*sqrt(pred_ok_var)
ggplot(dat[sample(1:nrow(dat),100),]) +
geom_pointrange(aes(x = id, y=pred_ok,
ymin=lb_ok, ymax=ub_ok), color='red')+
geom_pointrange(aes(x = id, y=media,
ymin=lb_bayes, ymax=ub_bayes), color='blue')
In the previous graph we can see that the intervals of prediction between the models are similar but, in general, the probability confidence intervals of the bayesian prediction are smaller.